[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150
Conversation
Greptile SummaryThis PR bumps the
Confidence Score: 5/5Safe to merge — the submodule bump is a targeted bug fix and the shell script changes only tighten the existing skip guards. The only code change is a submodule pointer update and consistent NVLink detection guards across four launcher scripts. The new detection method (checking for active link bandwidth via nvidia-smi nvlink --status) is strictly more precise than the old topology-matrix check, and the worst failure mode is an over-skip on an unusual hardware configuration rather than a hang or data corruption. No TE library code is modified. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[EP test/bench script starts] --> B{GPU count >= 4?}
B -- No --> C[SKIP: not enough GPUs]
B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
D -- Yes --> F[Run EP test / bench]
F --> G{Exit code == 0?}
G -- Yes --> H[PASS]
G -- No --> I[FAIL]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[EP test/bench script starts] --> B{GPU count >= 4?}
B -- No --> C[SKIP: not enough GPUs]
B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
D -- Yes --> F[Run EP test / bench]
F --> G{Exit code == 0?}
G -- Yes --> H[PASS]
G -- No --> I[FAIL]
Reviews (6): Last reviewed commit: "Detect active NVLink via nvlink --status..." | Re-trigger Greptile |
|
/te-ci L1 |
|
/te-ci L1 |
2 similar comments
|
/te-ci L1 |
|
/te-ci L1 |
jberchtold-nvidia
left a comment
There was a problem hiding this comment.
LGTM, thanks!
|
Pipeline #56213025 |
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
|
/te-ci L1 |
…NS_PER_RANK (#3150) * nccl with relax num_dispatch_tokens%64!=0 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Skip EP tests/examples on nodes without NVLink Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Description
Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK
Type of change
Checklist: